Clote et al. BMC Bioinformotics 2012, 13(Suppl 5):S6 
http://www.biomedcentral.eom/1 471 -21 05/1 3/S5/S6 



RESEARCH Open Access 



Maximum expected accuracy structural neighbors 
of an RNA secondary structure 

Peter Clote 1 ' 2 ' 3 ', Feng Lou 2 ', William A Lorenz 4 

From First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 
2011) 

Orlando, FL, USA. 3-5 February 201 1 



^BMC 

Bioinformatics 



Abstract 

Background: Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to 
develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss , RNAshapes and 
RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect 
general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the 
development of additional algorithms to detect conformational switches seems important, especially since the 
difference in free energy between the two metastable secondary structures may be as large as 15-20 kcal/mol. It 
has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum 
expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure. 

Results: Given an arbitrary RNA secondary structure S 0 for an RNA nucleotide sequence a = a h ..., a n , we say that 
another secondary structure 5 of a is a /c-neighbor of 5 0 , if the base pair distance between 5 0 and 5 is k. In this 
paper, we prove that the Boltzmann probability of all ^-neighbors of the minimum free energy structure S 0 can be 
approximated with accuracy s and confidence 1 - p, simultaneously for all 0 < k < K, by a relative frequency count 

over N sampled structures, provided that ^ > ^ ^ , where <D(z) is the cumulative distribution 

function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAbor ME A, which for 
an arbitrary initial structure S 0 and for all values 0 < k < K, computes the secondary structure MEA(k), having 
maximum expected accuracy over all /c-neighbors of S 0 . Computation time is 0(n 3 • K 2 ), and memory requirements 
are 0(n 2 • K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches. 

Conclusions: The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the 
computation of maximum expected accuracy /c-neighbors by RNAbor ME A, provide additional tools toward 
conformational switch detection. Results from RNAbor ME A are quite distinct from other tools, such as RNAbor, 
RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures 
or conformational switches. Source code for RNAbor ME A can be downloaded from http://sourceforge.net/ 
projects/rnabormea/ or http://bioinformatics.bc.edu/clotelab/RNAborMEA/. 
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Background 

RNA secondary structure conformational switches play 
an essential role in a number of biological processes, 
such as regulation of viral replication [1] and of viroid 
replication [2], regulation of Rl plasmid copy number in 
E. coli by hok/sok system [3], transcriptional and transla- 
tional gene regulation in prokaryotes by riboswitches 
[4], regulation of alternative splicing in eukaryotes [5], 
and stress-responsive gene regulation in humans [6], etc. 
Due to the biological importance of conformational 
switches, several groups have developed algorithms that 
attempt to recognize switches - in particular, thermody- 
namics-based methods such as paRNAss[7], RNA- 
shapes[8], RNAbor[9], as well as an approach using 
the second eigenvalue of the Laplacian matrix [10]. 

Riboswitches are portions of the 5' untranslated region 
(UTR) of messenger RNAs, experimentally known to 
regulate genes in bacteria by allostery [4], and to regu- 
late alternative splicing of the gene NMT1 in the eukar- 
yote Neurospora crassa [5]. Riboswitches are composed 
of a 5' aptamer and a 3' expression platform. Since the 
aptamer binds to a specific ligand with high affinity (K D 
« 5 nM), thus triggering the conformational change of 
the expression platform upon ligand binding [11], its 
sequence and secondary structure tend to be highly con- 
served. In contrast, there is lower sequence for the 
expression platform, which forms a bistable switch, 
effecting gene regulation by premature abortion of tran- 
scription (as in guanine riboswitches [12]), or by seques- 
tering the Shine-Dalgarno sequence (as in thiamine 
pyrophosphate riboswitches [13]). Due to the conserved 
sequence and secondary structure within the aptamer, 
all existent algorithms (to the best of our knowledge), 
such as [14-16], attempt only to detect riboswitch apta- 
mers, without the expression platform. In addition to 
these specific algorithmic approaches, more general 
computational tools that rely on stochastic context free 
grammars, such as Infernal [17] and CMFinder[18], 
have been trained to recognize riboswitch aptamers; in 
particular, Infernal was used to create the Rfam data- 
base [19], which includes 14 families of riboswitch 
aptamers. 

Since current riboswitch detection algorithms do not 
attempt to predict the location of the expression plat- 
form, we have developed tools, RNAbor- Sample and 
RNAborMEA, described in this paper, which yield infor- 
mation concerning alternative, or suboptimal, structures 
of a given RNA sequence. These tools can suggest the 
presence of a conformational switch; however, much 
more work must be done to actually produce a ribos- 
witch gene finder, part of the difficulty due to the fact 
that riboswitch aptamers contain pseudoknots that can- 
not be captured by secondary structure. 



In previous work [20,21], we described a novel pro- 
gram RNAbor to predict RNA conformational switches. 
For a given secondary structure S of a given RNA 
sequence s, the secondary structure T of s is said to be 
a /^-neighbor of S, if the base pair distance between S 
and T is k. (Base pair distance is the minimum number 
of base pairs that must be either added or removed, in 
order to transform the structure S into T.) Given an 
arbitrary initial structure S 0 , for all values 0 < k < K, the 
program RNAbor [20], computes the secondary structure 
MFE(k), having minimum free energy over all /c-neigh- 
bors of S 0 . (Note that K < 2 ♦ n, since the base pair dis- 
tance between any two secondary structures of a length 
n RNA sequence is at most 2 • n.) As well, RNAbor 
computes for each value 0 < k < K, the Boltzmann prob- 
ability p k = ^ffl, where Z(k) is the sum of all Boltzmann 
factors exp(-E(S)/RT) of all structures S having base pair 
distance k from an initially given structure S 0 , and 
where the partition function Z is the sum of all Boltz- 
mann factors of all secondary structures of the given 
RNA sequence. Here E(S) is the free energy of second- 
ary structure S, with respect to the Turner energy 
model [22,23], R = 0.001987 kcal mol" 1 IC 1 is the uni- 
versal gas constant, and T is absolute temperature. In 
the case that S 0 is the minimum free energy structure, 
the existence of one or more 'peaks', or values k » 0, 
where p^ is relatively large, suggests that there are two 
or more low energy structures having large base pair 
distance k from S 0 - i.e., a potentially distinct meta- 
stable structure, as shown in Figure 1. 

In [24], Do et al. introduced the notion of maximum 
expected accuracy (MEA) secondary structure, determined 
as follows: (i) compute base pairing probabilities p(U j) 
using a trained stochastic context free grammar; (ii) com- 
pute probabilities q{i) = 1 - .p{hj) - .PCM) 
that position i does not pair; (Hi) using a dynamic pro- 
gramming algorithm similar to that of Nussinov and 
Jacobson [25], determine that secondary structure S having 
maximum score J2(i,j)eS 2a ' P0''i) + ^unpaired Mi, where 
the first sum is over paired positions (/, ;) of S and the sec- 
ond sum is over positions i located in loop regions of «S, 
and where a, /? >0 are parameters with default values 1. 
Subsequently Kiryu et al. [26] computed the MEA struc- 
ture by replacing the stochastic context free grammar 
computation of base pairs in (i) by using McCaskilFs algo- 
rithm [27], which computes the Boltzmann base pairing 
probabilities 



Km) = 



{S:(i,j)eS} 



exp(-E(S)/RT) 



£ s exp(-E(S)/£T) 



(1) 



The sum in the numerator is taken over all secondary 
structures of the given RNA sequence, that contain base 
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base pair distance from MFE structure 

Figure 1 Output of RNAboron the 27 nt bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial 

structure S 0 , the minimum free energy (MFE) structure ((((((((....)))))))) with free energy -10.3 kcal/mol. The 16-neighbor of S 0 is the 

metastable structure ((((((((....)))))))) with free energy -9.9 kcal/mol. The MFE structure appears above the leftmost peak, while the MFE(\6) 

structure appears above the rightmost peak. The output of RNAbor includes a graph of the Boltzmann probabilities = and MFEik) 
structures, for all 0 < k < 2n. The existence of distinct 'peaks' suggests the presence of a conformational switch. 



pair (z, ;), while the sum in the denominator is taken 
over all secondary structures of the given RNA 
sequence. Thus p(i, j) is the sum of the Boltzmann fac- 
tors of all secondary structures that contain the fixed 
base pair (/, ;), divided by the partition function, which 
latter is the sum of Boltzmann factors of all secondary 
structures. In fact, Kiryu et al. [26] describe an algo- 
rithm to compute the MEA structure common to all 
RNAs in a given alignment. Later, Lu et al. [28] redis- 
covered Kiryus method; in addition, Lu et al. computed 
suboptimal MEA structures by implementing an analo- 
gue of Zukers method [29]. 

Our motivation in developing both RNAbor- Sample 
and RNAbor ME A, was to simplify and improve our pre- 
vious software, RNAbor, in detecting conformational 
switches. Since RNAbor computes the minimum free 
energy structure, MFE(k), over all structures having base 
pair distance k from an initially given structure S 0 , a 
complex portion of the code in RNAbor concerns the 
retrieval of free energy parameters from the Turner 
model [22,23]. The idea of RNAbor MEA was to compute 
the base pairing probabilities p(i, j) - see equation (1) - 
by McCaskill's algorithm using RNAf old, then to com- 
pute the maximum expected accuracy structure, MEA 
(/c), which needs no retrieval of energy parameters, and 
which we hoped would be very similar to the MFE(k) 



structure, in light of previously mentioned results 
[26,28]. Surprisingly, it turns out that MEA(k) structures 
are quite different from MFE(k) structures, as shown 
later in one of the figures. 

In this paper, we begin by showing rigorously how to 
approximate the output of RNAbor by frequency counts 
from sampling, using S fold [30]. We then extend the 
MEA technique to compute the maximum expected 
accuracy k-neighbor of a given RNA secondary structure 
S 0 ; i.e., that secondary structure which has maximum 
expected accuracy over all structures that differ from S 0 
by exactly k base pairs. By analyzing the family of purine 
riboswitches, obtained by retrieving full riboswitch 
sequences (aptamer and expression platform) from cor- 
responding EMBL genomic data, by extending the apta- 
mers from the seed alignment of Rfam family RF00167 
[31], we show that our software RNAbor ME A produces 
strikingly different results from other software that pro- 
duce suboptimal structures (RNAbor, RNAbor- Sam- 
ple, RNAlocopt, RNAshapes, UNAFold). 

Since the detection of computational switches remains 
an open problem, despite the success of some tools 
such as RNAshapes and RNAbor, we feel the addition 
of the tool RNAborMEA could prove useful, since it 
appears to be orthogonal to all other methods of gener- 
ating suboptimal secondary structures. 
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Results and discussion 

In this paper, we describe the following new results, dis- 
cussed in the 'Methods' section in greater detail with 
attendant definitions of unexplained concepts. 

1. We describe a Python script RNAbor- Sample 
that approximates the output p k = of RNAbor by 
frequency counts from sampled structures, for all 
0 < k < 2n, using Sf old[30], or RNAsubopt -p 
[32]. 

2. We prove that for any desired accuracy 0 < s and 
probability 0 < a < 1, if at least 

N{e,p,K)= 4 (2) 



structures are sampled, then 

P{\pk-Pk\ <s)>l-a (3) 



for all 0 < k < K; i.e., RNAbor- Sample furnishes 
estimates p^ of p k) for all 0 < k < K, which with con- 
fidence 1 - a are within s of the actual values p k . 
Here, is the cumulative distribution function 
(CDF) for the standard normal distribution. 
3. We develop an algorithm, RNAbor ME A, running 
in time 0(n 3 • K 2 ) and space 0(n 2 ♦ K), which com- 
putes simultaneously for all 0 < k < K, the maximum 
expected accuracy k-neighbors of a given RNA sec- 
ondary structure S 0 ; i.e., for each 0 < k < K, RNA- 
borMEA computes that structure S k which has 
maximum expected accuracy over all structures that 
differ from S 0 by exactly k base pairs. The algorithm 
RNAbor ME A additionally computes, for each 0 < k < 
K, the pseudo partition function 

Z fe = J2 exp{MEA{S)/BT). 

{S:d BP (S,S 0 )=k} 



Moreover, RNAbor ME A allows the user to stipulate 
(partial) hard constraints, that stipulate whether par- 
ticular nucleotides are unpaired, or base-pair with 
certain other nucleotides. The implementation of 
hard constraints follows ideas from Mathews [33], 
albeit suitably modified to simultaneously consider 
all /^-neighbors, for 0 < I< < K. 

We now describe the 13 figures and 4 tables, corre- 
sponding to computational experiments performed with 



RNAbor- Sample and RNAborMEA. These tables and 
figures are only briefly described, and we refer the 
reader to the captions of the figures and tables, which 
explain the results in greater detail. 

Figure 1 illustrates the presence of two peaks, corre- 
sponding to the Boltzmann probability of each of the 
metastable structures for a 27 nt bistable switch pre- 
viously considered by Hofacker et al. Figure 2 displays 
the Boltzmann probabilities p k from RNAbor, Boltz- 
mann probabilities estimates fa from RNAbor- Sample 
for the SAM riboswitch aptamer with GenBank acces- 
sion code AP004597.1/11894-11904. Clearly, probability 
estimates fa are close to actual values p k . The figure 
additionally shows probabilities r k from our software 
RNAlocopt[34], computed by r k = §jj§p where Z(LO) 
is the sum of Boltzmann factors of all locally optimal 
secondary structures, and Z k (LO) is the sum of all 
locally optimal /^-neighbors of S 0 . A secondary structure 
S is said to be locally optimal, if its energy does not 
decrease by the addition or removal of a single (valid) 
base pair; i.e., E(S U {(*, y)}) > E(S), and E(S - {(*, y)}) > 
E(S). Figure 3 displays the experimentally determined 
GENE ON and GENE OFF structures of an XPT gua- 
nine riboswitch from B. subtilis, taken from [35]. Figure 
4 shows the outputs of RNAborMEA, RNAbor, and 
RNAshapes, which are most similar to the GENE ON 
structure from the previous Figure 3. Figures 5 and 6 
determine the structural simlarity, as measured by the 
program NestedAl ign[36], between that structure 
output by RNAborMEA (as well as structures output by 
RNAbor, RNAbor- Sample, RNAlocopt, RNAshapes, 
and UNAFold), which are most similar to the XPT pur- 
ine riboswitch, displayed in Figure 3. Figure 5 deter- 
mines the structural similarity to the GENE ON 
structure (left panel of Figure 3), while Figure 6 deter- 
mines the structural similarity to the GENE OFF struc- 
ture (right panel of Figure 3). None of the structural 
neighbors, or sampled structures, are identical to the 
GENE ON or GENE OFF structures; however, there are 
some candidates that bear some resemblance to those 
structures. At this point, we can say that RNAbor- Sam- 
ple and RNAborMEA are methods that generate subop- 
timal structures, some of which may be similar to the 
metastable structures of a conformational switch; how- 
ever, much additional work is necessary before a robust 
method can be developed to detect conformational 
switches. 

Figure 7 shows that the MEA(k) structural neighbors, 
as computed by RNAborMEA, are very different than the 
MFE(k) structural neighbors, as computed by RNAbor. 
At present, such computational experiments show RNA- 
borMEA computes suboptimal structures, which seem 
to share (chimeric) similarities between parts of low 
energy structures, but which themselves do not have 
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-RNAbor-Sample 
-RNAIocopt 
RNAbor 



0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 

Figure 2 Boltzmann density plot for RNAbor, along with approximating relative frequency plots for RNAborMEAandRNAlocoptfor 
the 101 nt RNA sequence UACUUAUCAA GAGAGGUGGA GGGACUGGCC CGCUGAAACC UCAGCAACAG AACGCAUCUG UCUGUGCUAA 
AUCCUGCAAG CAAUAGCUUG AAAGAUAAGU U for the SAM riboswitch aptamter with GenBank accession code AP004597.1/1 18941 - 
1 19041. The program RNAbor computes the Boltzmann probability = where = ^2{S:d B p{S,S 0 )=k} ex P( — where S 0 is 
the initial structure (taken as the minimum free energy here). The script RNAbor-Sample calls Sfold on 1000 structures, in order to compute 
a relative frequence f k ~ p k of all /c-neighbors of 5 0 . Finally, we compute relative frequency of RNAIocopt [34], a program that samples only 
locally optimal secondary structures, having the property that one cannot obtain a lower energy structure by adding or removing a single base 
pair. 



very low energies. Such suboptimal structures appear to 
be 'orthogonal' to those output by all other methods, 
such as Sfold, RNAbor, RNAbor-Sample, RNAIo- 
copt, RNAshapes, UNAFold). Figure 8 displays the 
output of RNAborMEA, given the sequence of a TPP 
riboswitch with EMBL accession code AF269819/1811- 
1669. In this instance, RNAborMEA found two low 
energy structures having large base pair distance from 
each other. (Other computational experiments did not 
yield such a good example.) Figure 9 displays the free 
energy and maximum expected accuracy scores, for 
each of the /c-neighbors of the given TPP riboswitch 
sequence, just described in Figure 8. Figures 10 and 11 
present the pseudocode for the RNAborMEA algorithm, 
which given an RNA sequence a lf .. ,,a n and initial 
structure S 0 , computes the MEA(k) structure and pseudo 
partition function z fe , for each 0 < k < K in time 0(n 3 • 
K 2 ) and space 0(n 2 • K). Fi gure 12 presents pseudocode 
for the 0(n 2 ) algorithm to sample structures from the 
ensemble of structures having high MEA scores - a 
maximum expected accuracy analogue of the sampling 



algorithm Sfold[30]. Figure 13 displays the pseudo- 
Boltzmann probabilities pu = y ^ or two sma ^ RNA 
sequences. While temperature T has a natural signifi- 
cance, when computing Boltzmann probabilities p fe = ^, 
there is no natural meaning of temperature T, when 
computing pseudo Boltzmann factors exp(MEA(S)/RT), 
and indeed very different curves can be obtained with 
different temperatures. 

We now briefly describe Tables 1, 2, 3, 4. Table 1 pro- 
vides some sample sizes N, computed by the formula 
from equation (2), for an s approximation of Boltzmann 
probabilities p h 0 < k < K, with 1 - a confidence level. 
Tables 2 and 3 provide the numerical values for the ear- 
lier described Figures 5 and 6, where the NestedA- 
lign structural similarity is computed for the most 
similar /c-neighbor, determined by RNAborMEA, RNA- 
bor-Sample and RNAIocopt. Table 4 presents the 
number of times that each of the methods RNAborMEA, 
RNAbor, RNAbor-Sample, RNAIocopt, RNA- 
shapes, UNAFold output the most similar structure 
to the GENE ON resp. GENE OFF structure for the 
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XPT purine riboswitch described in Figure 3. This com- 
putational experiment was performed for all RNA 
sequences in the seed alignment of the Rfam purine 
riboswitch family RF00167 [31]. This table shows that 
RNAborMEA and RNAbor both outperform any other 
method in determining structures similar to the GENE 
OFF XPT structure; however, RNAborMEA uniquely 
outperforms all methods, including RNAbor, in deter- 
mining structures similar to the GENE ON XPT struc- 
ture. One of the reasons for this excellent result is that 
unlike other methods, RNAborMEA does not look for 



RNAborMEA: 




low energy structures, but rather for maximum expected 
accuracy structures. 

The figures and tables show, in summary, that RNA- 
borMEA provides useful suboptimal structures, which 
may be closer to metastable structures of a conforma- 
tional switch than more traditional methods, which rely 
on searching for low energy structures. 

Conclusions 

We have applied the notion of maximum expected accu- 
racy within the context of structural neighbors of a 



Figure 3 GENE ON (left) and GENE OFF (right) secondary structures for the 148 nt. XPT guanine riboswitch from B. subtilis with sequence 
CACUCAUAUA AUCGCGUGGA UAUGGCACGC AAGUUUCUAC CGGGCACCGU AAAUGUCCGA CUAUGGGUGA GCAAUGGAAC CGCACGUGUA 
CGGUUUUUUG UGAUAUCAGC AUUGCUUGCU CUUUAUUUGA GCGGGCAAUG CUUUUUUU. Sequence and secondary structure taken from [35]. 
The free energy of GENE ON resp. GENE OFF secondary structrure is -16.46 kcal/mol resp. -22.6 kcal/mol. Free energies were determined using 
RNAeval and figures produced using RNAplot, both from the Vienna RNA Package [40]. 



Global alignment score : 87.5 

CACUCAUAUAAUCGCGUGGAUAUGGCACGCAAGUUUCUACCGGGCACCGUAAAUGUCCGACUAUGGGUGAGCAAUG- - - -GAA-CCGCACGUGUACGGUUUUUUGUGAUAUCAG- -CAUUGCUUGCUC- - • -UUUAUUUGAGCGGGCAAUGCUUUUUUU 

UUACAAUAUAAUAGGAAC ACUC AUAUAAUCGCGUGGAUAUGGC ACGCAAGUUUCUACCGGGCACCGUAAAUGUCCGACUAUGGGUG A • GCAAUGGAACCGCACGUGUACGGUUUUUUGUGAUAUCAGCAU UGCUUGCUCUUUAUUUGAGCGGGCAAUGCU - - UUUU 

(((...((((((( ))))))) (((((( )>>>))..)))<(<<(<(<<<<<----.(<-((< ))))) ))))))))))))•-•• 

..((<•(•( (..(.((((((...(.((<(( ))))).) (((((( ))))))..)))(((((•---.■((((.{((.((( )))»)).)))((.(....)))) ))))))((( ))).)))))))).)--.... 



RNAbor: 

Global alignment score : 68 

C ACUCAUA - UA - AU - CGCGUGGAUAUGGC ACGCA - AGU - UUCUAC - C - G GGCACCGUAAAUGUC --C-G — AC- UAU GGGUGAGCAAUGGAACCGCACGUGUACGGUUUUUUGUGAUAUCAGCAUUGCUUGCUCUUUAUUUGAGCGGGCAAUGCUUUUUUU 

UUAC AAUAUAAUAG - - 6A - A - - - C ACUCAUA - UAAUC - GCGUGGAUAOGGCA - CGC AAGUUUCUACCGGGC ACCGUAAAUGUCCGACUAUGGGUGAGC AAUGGAACCGC ACGUGUACGGUUUUUUGUGAUAUC AGCAUUGCUUGCUCUUUAUUUGAGCGGGCAAUGC - - UUUUU 

(((-..-.(•(((((( ))))))•)..• •(•(••••(((( ))))••)•) .-Ill ((((<(((((((.((((( ))))) )))))))))))) 

((((..(<((..((- .))..))-))...-((.{((.((.((.(-(....)).))))))).))...)))).((((({((..(((((((((((((((((( ))))))....((...)).)))))))))))) )))>))))....-- 



RNAshapes: 

Global alignment score : 64 

• - -CACUCAU- A UAAUCGCGUGGAUAUGGCACGCAAGUUUCUACCGGGCACCGUAAAUGUCCGACUAU-GGGUGAGCAAUGGAACCGCACGUGUACGGUUUUUUGUGAUAUCAGCAUUGCU-UGCUC UUUAUUUGAGCGGGCAAUGCUUUUUUU 

UUACAAU ■ AUAAUAGGAACACUCAUAUAAUCGCGUGGAUAUGGCACGCAAGUUUCUACCGGGCACCGUAAAUGUCCGACUAUGGG- UGAGCAAUGGAACCGCACGUGUACGGUUU U • UUGUGAUA - UCAGCAUUGCUUGCUCUUUAUUUGAGCGGGCAAUGC - • UUUUU 

— ((( ...<<<<<<< ))))))) (((((( ))))))..)))•(((((<((((((.((((( ))))) )))))))•)>))) 

-....<<(<(<( ((((( ))))>..))>)))).<(<((< )))))) ((•((.((((.((((((( )))))) )-))))..))-)).<((((({ (((((( >))>))>>))>))-- 

Figure 4 Given riboswitch sequence X83878/1 68-267 and initial structure S 0 , the minimum free energy structure, a structure output 

by RNAborMEAis most structurally similar to the XPT GENE ONstructure, as measured by NestedAlign[36]. The NestedAlign score 
for RNAborMEA is 87.5, while optimal score for RNAbor is 60.0, and for RNAshapes is 64.0. 
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— ♦— RNAbor 
-■-RNAborMEA 

A RNAbor-Sample 
— W— RNAIocopt 
— *— RNAshapes 
UNAFold 



Figure 5 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved 
downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was 
included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAIocopt, RNAshapes, UNAFold. Each 
program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as 
examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE{k) structure and the MEA{k) structure, which have 
minimum free energy resp. maximum expected accuracy among all /(-neighbors of the intial minimum free energy structures 5 0 . Subsequently, we 
applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE ON 
structure for XPT guanine riboswitch of B. subtilis; i.e. the left panel of Figure 3. (Similar structures have positive scores; dissimilar structures have 
negative scores.) For each RNA in the seed alignment of RF00167, we determined the value k h such the MEA{k{) structure for that RNA has the 
greatest structural similarity with the XPT GENE ON structure, as determined by NestedAlign. (See the left panel of Figure 3 for the 
experimentally determined GENE ON structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes 
[39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and programs RNAbor [9] and RNAIocopt 
[34], developed by our lab. In 21 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the 
experimentally determined XPT GENE OFF structure, as measured by NestedAlign. 



given RNA sequence a l7 .. ., a n and structure S 0 . Our 
software RNAborMEA not only computes the structures 
MEA(k) having maximum expected accuracy over all 
structures S, whose base pair distance d BP {S 0 , S) is equal 
to k. In addition, RNAborMEA allows the user to enter 
structural constraints, which specify partial secondary 
structures required of all MEA(k) structures, if so 
desired. Additionally, RNAborMEA computes an analo- 
gue of the temperature-dependent partition function, 
defined by 

Z fe (T) = exp{cr{S))/RT 



{S:d BP {S 0 ,S)=k} 



and 



Z(T) = ^Z fe = £exp(a(S))/RT. 



Here, the expected accuracy score a is defined by 
a{S) = 2- Pi,j+ * 

(ij')eS iunpaired 

where first sum is taken over all base pairs (i, /) 
belonging to S, and the second sum is taken over all 
unpaired positions in S, and where p if j [resp. qj\ is the 
probability that r, / are paired [resp. i is unpaired] in the 
ensemble of low energy structures, as computed by 
McCaskiirs algorithm [27]. Finally, RNAborMEA allows 
the user to sample structures from the maximum 
expected accuracy ensemble, in a fashion analogous to 
Ding-Lawrence sampling from the low energy Boltz- 
mann ensemble, as in S fold [30]. 

Our preliminary investigations have not indicated a 
clear application of the partition function analogue, 
though it may be construed to provide a representation 
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Figure 6 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved 
downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was 
included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor- Sample, RNAlocopt, RNAshapes, UNAFold. Each 
program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as 
examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE{k) structure and the MEA{k) structure, which have 
minimum free energy resp. maximum expected accuracy among all /c-neighbors of the intial minimum free energy structures S 0 . Subsequently, we 
applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE OFF 
structure for XPT guanine riboswitch of B. subtilis; i.e. the right panel of Figure 3. (Similar structures have positive scores; dissimilar structures have 
negative scores.) For each RNA of the seed alignment of RF00167, we determined the value k h such the MEA{k-\) structure for that RNA has the 
greatest structural similarity with the XPT GENE OFF structure, as determined by NestedAlign. (See the right panel of Figure 3 for the 
experimentally determined GENE OFF structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes 
[39] and UNAFold [41], the programs RNAborMEA and RNAbor -Sample, described in this paper, and RNAbor [9] and RNAlocopt [34]. In 22 
out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF 
structure, as measured by NestedAlign. 



of the temperature-dependent mixing of various struc- 
tures having large score a. On the other hand, in com- 
putational experiments reported in the Results Section, 
it appears that RNAborMEA produces near-optimal 
structures that are closer to the biologically functional 
structures, in the case of conformational switches that 
are difficult to predict by any method. 

Indeed, in 18 [resp. 11] out of 34 instances, RNAbor- 
MEA produced the secondary structure most structurally 
similar to the experimentally determined XPT GENE 
ON [resp. GENE OFF] structure, as measured by Nes- 
tedAlign[36]. See Table 4. Since there appears to be 
little to no correlation between the structures MFE(k) 
output by RNAbor [20] and the structures MEA(k) out- 
put by our current program RNAborMEA, it appears 
that RNAborMEA yields a signal that is orthogonal and 
complementary to that provided by state-of-the-art ther- 
modynamics software, such as UNAFold, RNAfold, 
RNAstructure, Sfold, RNAshapes, RNAbor, etc. 
For these reasons, we feel that RNAborMEA has a cer- 
tain value, along with the programs UNAFold, RNA- 
fold, RNAstructure , Sfold, RNAshapes, 
RNAbor, etc. when producing suboptimal structures. 
RNAborMEA is written in C and available at http:// 



sourceforge.net/projects/rnabormea/ and http://bioinfor- 
matics.bc.edu/clotelab/RNAborMEA/. 

Methods 

Preliminaries 

Recall the definition of RNA secondary structure. 

Definition 1 A secondary structure S on RNA 
sequence a lf .. ., a n is defined to be a set of ordered pairs 
{U j), such that 1 < i < j < n and the following are satis- 
fied. 

1. Watson-Crick or GU wobble pairs: If {U j) belongs 
to S, then pair {a h aj) must be one of the following 
canonical base pairs: (A, £/), {U, A), (G, C), (C, G), 
(G, U)t (U, G). 

2. Threshold requirement: If (r, belongs to S, then j 
- i > 6, where 6, generally taken to be equal to 3, is 
the minimum number of unpaired bases in a hairpin 
loop; i.e., there must be at least 6 unpaired bases in 
a hairpin loop. 

3. Nonexistence of pseudoknots: If (/, /) and (/c, £) 
belong to S, then it is not the case that i < k < j <£. 

4. No base triples: If (/, ;) and (/, k) belong to S, then 
j = k; if {it j) and (/<, ;) belong to S, then i = k. 
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iz 

£ base pair distance to the initial secondary structure 

Figure 7 Figure depicting the increasing divergence between RNAborand RNAborMEA. For each RNA sequence in the seed alignment 
from Rfam family RF00066 of U7 small nuclear RNAs, both RNAbor and RNAborMEA were run, in order to compute the MFE{k) structure and 
the MEA{k) structure, which have minimum free energy resp. maximum expected accuracy among all /(-neighbors of the intial minimum free 
energy structures 5 0 . We computed the base pair distance between the MFE{k) structure and the MEA{k) structure over all sequences in the seed 
alignment of RF00066. The figure displays the average ± one standard deviation of base pair distance. 



The preceding definition provides for an inductive 
construction of the set of all secondary structures for a 
given RNA sequence a lf .. a n . For all values of d = 0, 

n and all values of i = 1, n - d, the collection 
Stf+d of all secondary structures for a b .. ., a i+d is defined 



as follows. If 0 < d < 9, then S^+d = {0}; i.e., the only 
secondary structure for a b .. ., a i+d is the empty struc- 
ture containing no base pairs (due to the requirement 
that all hairpins contain at least 6 unpaired bases). If d 
> 6 and Sjj has been defined by recursion for all i < j < 




Figure 8 Sample outputs from RNAborMEAon a 143 nt TPP-riboswitch, AF269819/181 1-1669 with sequence CUACUAGGGG 
AGCCAAAAGG CUGAGAUGAA UGUAUUCAGA CCCUUAUAAC CUGAUUUGGU UAAUACCAAC GUAGGAAAGU AGUUAUUAAC 
UAUUCGUCAU UGAGAUGUCU UGGUCUAACU ACUUUCUUCG CUGGGAAGUA GUU. We took the TPP riboswitch aptamer from the Rfam 
database [19], then extracted right-flanking nucleotides from the corresponding EMBL file, in order to include the expression platform. Displayed 
from left to right are the structures MEA{0) and M54(61) (the structure MEA{52) is similar to that of M£A(61) and corresponds to a free energy 
local minimum in the left figure.) The structure MEA{61) had the highest MEA score over all structural neighbors, including the original structure 
S 0 = MEA{0), and had free energy, -46.0 kcal/mol, that was equal to that of the initial structure S 0 = MEA{0), which is the minimum free energy 
structure for the given sequence. 
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TPP-nboswitch AF269819/181 1-1669 





0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 

Base pair distance from MFE structure Base pair distance from MFE structure 

Figure 9 (Left) Free energy for all MEA{k) structural neighbors, 0 < k < 99, of the TPP-riboswitch, AF26981 9/1 81 1-1669, described in 
the previous figure. Clearly, MEA{0) and MEA(61) have the least energy, - 46.0 kcal/mol, and MEA(61) has the largest MEA score, 134.555, of all 
secondary structures for the given RNA sequence. It is more common that the free energy of the MEA{k) structure is monotonically increasing as 
a function of k. (Right) MEA score for all MEA{k) structural neighbors, 0 < k < 99, of the TPP-riboswitch, AF26981 9/1 81 1-1669, described in the 
previous figure. Clearly, MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. 



1. void RNAborMEA(s,5 0 ,M) 

2. //M(i,j, k) is the score of MEA k-neighbor of So 

3. initialize M(Lj. k) = 0 for all 1 < i.j < n, 0 < k < n 

4. compute pjj for all 1 < i < j < n (McCaskill's algorithm) 

5 . for ? = 1 to n 

6 • Qi = 1 - Ej>i PiJ ~ J2j<i Pj.i 

7. //qi is Boltzmann probability that i is unpaired 

8. for d = 0 to n— 1 // d is diagonal offset value 

9. for i = 1 to n — d 

10. j = i + d 

11. for k = 0 to n 

12. if j — i < 6 I/O unpaired bases in hairpin 

13. if k == 0 

15. else // k > 0 

16. break // for all k > 0 M(ij,k) = 0 

17. else if j - i == 0 + 1 

18. if (ij) e Sq then 

19. j,0) = 2orj + ZiZl+i far 

21. break //for k>l, M(ij,k) = 0 

22. else // 0 S 0 

23. M(ij,o) = j:i=iPqr 

24. if basePair(i, j) then 

25. Af(i,i,l) = 2ap ilj + E^i + i^r 

26. break //for other cases M(i,j,k) = Q 

27. else // j - i > 6> + 1 

Figure 10 Initial portion of pseudocode for RNAborMEAalgorithm, which continues in Figure 11. Given RNA sequence s = s 1; .. ,s n of 

length n, initial secondary structure 5 0 of s, RNAborMEA computes for all values of 0 < k < n that structure S with base pair distance k from 5 0 , 

which maximizes the value ^ft'j' ^) ~ XI ^- a Phj + XI The pseudocode actually computes only values M(/, /, /c) for all /', /', /c 

(i,j)eS iunpairedins 

the MEA structures are obtained by backtracing. This algorithm clearly runs in 0(n 5 ) time with 0(n 3 ) space. 
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27. 




else // j - i > 6 + 1 


28. 




max = 0 // M(i t j, k) = max of following 


29. 




// Case 1: j unpaired in S[i,j] 


30. 




bo = dBp(So[iJ-l],So[i,j]) 


31. 




//bo = 1 if j paired in S 0 [i,j], else 0 


32. 




val = M — 1, k - bo) + /3qj 


33. 




if vol > max then 


34. 




max = val 


35. 




index = (0.0.0) 


36. 




//backtracking: j unpaired 


37. 




// Case 2: (/'. j) G S 


38. 




if basePair{i, j) //check if i.j can pair 


39. 




bi = d B p(So[i + IJ - 1] U {(hj)},S 0 [hj]) 


40. 




val = M{% + 1, j - 1, k - bi) I + 2ap. Lj 


41. 




if val > max then 


42. 




max = val 


43. 




index = (/'. k - , 0) 


44. 




//backtracking: (i.j) G S 


45. 




// Case 3: (r. j) 6 S for some % < r < j 


46. 




for r =2 + 1 to j -9-1 


47. 




if basePair(r,j) 


48. 




b 2 = d B p(S 0 [i, r - 1] U S 0 [r + 1, j - 1] U {(r, j)}, SbM) 


49. 




for /to = 0 to fe — 62 


50. 




fcj = A- - 6 2 - fc 0 //A'o + fci + 6 2 = & 


51. 




uaZ = M(i, r - 1, A^)) + M(r + 1J - 1, fci) + 2r>p,, y 


52. 




if val > max then 


53. 




max = val 


54. 




index — (r. fco, fel) 


55. 




//backtracking: (r. j) G S 


56. 




M(i,j, k) = max 


57. 




M(j, z, /c) = index 


Figure 11 Pseudocode for RNAborMEAalgorithm. Given RNA sequence s = s 1; .. ., s n of length n, initial secondary structure S 0 of s, 


RNAborMEA computes for all values of 0 < k < 


: n that structure S with base pair distance k from 5 0 , which maximizes the value 


M{i,j,k) = J2 2a Pij 


+ £ Hi 

iunpaired in 5 


. The pseudocode actually computes only values M{i,j, k) for all i,j, k; the MEA structures are 


obtained by backtracing. This algorithm clearly runs in 0(n 3 ) time with 0(n 3 ) space. 



1. void traceback(i,jik,M,paren) 

2. //perform traceback on [i.j] for A—neighbors of 5o[i,j] 

3. if j - i > 6 and M{Lj. k) > 0 

4. (r,fc 0 ,fci) = 

5. if r > 0 //j pairs with r in 

6. paren[r] = ' ( ' 

//note that paren has dummy char at position 0 

7 . paren[j] = ' ) 3 

8 . traceback (r + 1 , j — 1 , fcl , M, paren) 

9. traceback(i,r — 1, fc 0 , M, paren) 

10. else //r = 0, so j not paired in 

11 . traceback(i,j — 1, fco, M, paren) 

12. return 

Figure 12 Pseudocode for the 0(n 2 ) traceback computed by our RNAborMEAalgorithm. Note that run time could be reduced to 0{n In n) 
by applying the boustrephedonic method described in [42]. 
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Base pair distance from native structure Base pair distance from minimum free energy structure 

Figure 13 (Left) Pseudo-Boltzmann and uniform probabilities of structural neighbors MEA{k) for the 49 nt SECIS sequence fdhA, with 
nucleotide sequence CGCCACCCUG CGAACCCAAU AAUAAAAUAU ACAAGGGAGC AAGGUGGCG and where S 0 is (((((((.(((... 

((( ))).))).)))))))• Here, the (formal) parameter RT taken to be 49 (length of sequence), in order to uniformize MEA scores to range 

between 0 and 1. The pseudo-Boltzmann probability is defined by P^(fe) = — , where (i) = Iexp(M£4(5)//?7), the sum being taken over all 
5 such that d BP (S 0 , 5) = k, and (ii) Z = S^Z^. The uniform probability is defined By P u (}z) = where N [k) is the number of /c-neighbors of 5 0 
and N is the total number of secondary structures. (Right) Pseudo-Boltzmann probabilities for MEA{k) structural neighbors of the 27 nt Vienna 

bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial (minimum free energy) structure ((((((((....)))))))). 

The left curve is when RT = 0.6, the approximate value obtained by multiplying the universal gas constant 0.00198 kcal/mol times 310 Kelvin. In 
contrast, the right curve is when RT = 27 (length of sequence). Though not shown in this graph, the pseudo-Boltzmann distribution is identical 
with the uniform distribution, when RT = n, where n is sequence length. 



i + d, then 

♦ Any secondary structure of a if .. ., a i+d _i is a sec- 
ondary structure for a u .. a i+d , in which a i+d is 
unpaired. 

♦ If a ^ cij can form a Watson-Crick or wobble base 
pair, then for any secondary structure S for a i+1 , .. 
a i+d-i> the structure S U {(/, ;)} is a secondary struc- 
ture for a b a i+d . 

♦ For any intermediate value i + 1 < r < j - 6 - 1, if 
a r , cij can form a Watson- Crick or wobble base pair, 



Table 1 Number of samples needed for high-confidence 
approximation of Boltzmann probabilities 



p 


K 


s 


z 


N 


0.05 


1 


0.01 


1.45 


9506 


0.05 


100 


0.01 


3.48 


30276 


0.05 


1000000 


0.01 


5.45 


74256 


0.001 


100 


0.01 


3.89 


37830 


0.000001 


100 


0.01 


5.73 


82082 


0.05 


1 


0.001 


1.45 


950600 


0.05 


100 


0.001 


3.48 


3027600 



The number KJ _ js\ _ ^ \2k) of samples sufficient to 



guarantee that \f k - p k \ < e with confidence 1 - p, for all 0 < k < K, in the 

application RNAbor- Sample. Here = the Boltzmann probability, as 

computed exactly by RNAbor, for a /c-neighbor of S 0 , and f k is the relative 
frequency of /c-neighbors among N sampled structures. 



then for any secondary structure S for a b .,a r _i and 
any secondary structure T for a r+1 , a^, the struc- 
ture S U T U {(r, ;)} is a secondary structure for a b 

Given two secondary structures «S, T, we define the 
base pair distance between S, T, denoted by d BP («S, T), 
to be the cardinality of the symmetric difference of S, T; 
Le.,dBp(S, T) = \(S-T)V(T-S)\. 
RNAbor-Sample 

In this section, we describe how sampling from the 
Boltzmann ensemble, using Sf old[30], leads to a sim- 
ple and fast approximation of RNAbor computations, 
provided that the input consists of an RNA sequence 
and the minimum free energy (MFE) secondary struc- 
ture for that sequence. The work of this section is 
inspired by sampling approximations of the number of 
structural motifs, such as hairpins, multiloops, etc. of 
Ding and Lawrence [30], as well as the sampling 
approximation used in RNAshapes[8] for large 
sequences. A novelty of our work is that we provide a 
rigorous justification for the accuracy of the approxima- 
tion, depending on sample size. 

Let RNAbor-Sample denote the protocol, where we 
apply Sf old [30] to sample N secondary structures S of 
an input RNA sequence a lf .. .,a n , then subsequently 
compute the relative frequencies f k for 0 < k < K, where 
fk=u is defined to be the number of sampled 
structures S, whose base pair distance with Sq is /<, 
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Table 2 Comparison of NestedAlign similarity scores for the GENE ON structure of the XPT guanine riboswitch 



index 


EMBL 


RNAbor 


RNAborMEA 


RNAbor- Sample 


RNAlocopt 


RNAshapes 


UNAFold 


0 


AL591 981/205922-205823 


-9.0 


5.0 


-9.0 


-8.5 


-9.0 


-9.0 


1 


CP000764/271 074-271 1 75 


-43.5 


5.0 


-37.5 


-44.5 


-23.0 


-53.0 


2 


CP000764/308099-308200 


-27.0 


-18.0 


-24.5 


-31.5 


-25.5 


-22.0 


3 


BA000028/760473-760574 


-25.5 


-0.5 


-36.0 


-38.5 


-24.5 


-31.0 


4 


CP000557/252200-252301 


-9.5 


8.5 


-9.5 


8.5 


-10.0 


-12.0 


5 


X83878/1 68-267 


60.0 


87.5 


57.0 


66.0 


64.0 


59.0 


6 


BA000004/1 593074-1 592973 


35.0 


16.5 


-13.5 


-21.5 


-19.0 


-13.5 


7 


AAOX01 000023/1 9446-1 9345 


-15.0 


-2.0 


-13.0 


-18.5 


-13.5 


-15.5 


8 


CP00041 6/1 798040-1 7981 38 


5.5 


1.5 


1.5 


12.0 


4.5 


-4.5 


9 


CP000721 /398929-399026 


26.0 


24.5 


16.5 


-20.0 


21.5 


-32.0 


10 


BA000028/1 1 03943-1 1 04044 


1.0 


1.5 


2.0 


-0.5 


0.5 


0.5 


1 1 


ABDQ01 000002/25 1 055-25 1 1 52 


-16.0 


-2.5 


-16.5 


-21.5 


-17.5 


-22.5 


12 


AAXV0 1 000026/3 1 334-3 1 233 


1 1.5 


6.0 


-1.5 


-8.5 


22.0 


-3.0 


13 


AE01 6877/298774-298875 


-18.5 


14.0 


-17.5 


-34.0 


-12.0 


-26.5 


14 


BA000004/676475-676576 


-28.5 


-31.0 


-28.0 


-69.0 


-21.0 


-29.5 


15 


AE01 7333/692981-693082 


-1.5 


2.5 


-1 1.5 


-9.5 


-5.5 


-53.0 


16 


AM1 80355/25621 7-25631 8 


-17.0 


-45.0 


-45.5 


-49.0 


-48.0 


-49.0 


17 


AM406671/1 321 062-1 320965 


-25.5 


-15.0 


-22.0 


-28.5 


-23.5 


-23.5 


18 


CP00061 2/25981 1 1 -259801 2 


-42.0 


-39.5 


-42.0 


-47.5 


-39.0 


-38.5 


19 


CP000002/697032-6971 34 


-8.0 


-1 1.0 


-10.5 


-10.0 


-4.5 


-7.5 


20 


CP000002/2295936-2295837 


23.5 


47.0 


31.5 


21.0 


30.0 


22.5 


21 


AL5961 70/223345-223246 


-0.5 


7.0 


0.5 


-8.5 


-10.0 


-10.0 


22 


ABDQ01 000005/1 3 1 908-1 3 1 807 


-33.0 


-15.5 


-31.5 


-31.5 


-19.0 


-50.0 


23 


AAOX01 000052/9069-8968 


-13.5 


1.5 


-14.0 


-21.0 


-15.5 


-14.5 


24 


AE01 7333/4024324-4024425 


-29.5 


-26.5 


-33.5 


-24.0 


-23.5 


-36.0 


25 


AP006627/1 55471 7-1 55481 8 


-31.5 


-1.5 


-37.0 


-44.5 


-28.5 


-43.5 


26 


CP000024/1 182948-1 183043 


-0.5 


-18.5 


-9.0 


4.0 


2.0 


-19.0 


27 


BA000028/786767-786867 


-18.0 


-41.5 


-48.0 


-46.5 


-49.0 


-44.5 


28 


ABDP01 000002/29688-29587 


-34.5 


-42.5 


-34.5 


-37.0 


-35.0 


-50.0 


29 


BA000043/272473-272574 


-9.5 


4.0 


-9.5 


-10.0 


-3.0 


-12.5 


30 


CP000724/944285-944386 


-30.5 


-21.5 


-30.5 


-28.5 


-26.5 


-31.5 


31 


CP000764/1 409725-1 409826 


14.0 


-3.0 


-18.0 


-24.0 


-11.5 


-20.0 


32 


AAEK01 00001 7/86437-86538 


-44.5 


-44.0 


-41.5 


-52.0 


-35.0 


-49.0 


33 


CP000764/357645-357544 


11.0 


-13.5 


-33.0 


-26.0 


-18.5 


-36.0 



NestedAlign similarity scores between the GENE ON structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing 
(see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: 
RNAbor, RNAborMEA, RNAbor -Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 5, where more details on the computational 
experiment are given. 



divided by N. Since Sf old appears to be only available 
as a web server, where the user can not stipulate the 
number of sampled structures, we instead use the 
Vienna RNA Package implementation of Sf old, given 
in RNA sub opt -p[32]. 

Let be an arbitrary RNA sequence having 

MFE structure of S 0 . Following [9], let Z k denote the 
sum of Boltzmann factors of all /^-neighbors of S 0 ; i.e., 

Z k = £ exp(-£(S)/RT). 

d BP (S 0 ,S)=k 



As usual, let Z denote the partition function, repre- 
senting the sum of Boltzmann factors of all secondary 
structures of a lf .. ., a n ; i.e., 

z = j2^M-m/RT) 

s 

and let pk = y denote the Boltzmann probability of all 
/c-neighbors. 

Given a desired approximation accuracy of s, a prob- 
ability p, and an upper bound 7<T, we wish to compute a 
value N = N(e, p, !<), such that whenever we sample at 
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Table 3 Comparison of NestedAlign similarity scores for the GENE OFF structure of the XPT guanine riboswitch 
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NestedAlign similarity scores between the GENE OFF structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing 
(see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: RNAbor, 
RNAborMEA, RNAbor- Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 6, where more details on the computational 
experiment are given. 



Table 4 Number of times that the most similar structure was produced 



Method 



greatest similarity to gene on 



greatest similarity to gene off 



RNAborMEA 
RNAbor 

RNAbor- Sample 
RNAlocopt 
RNAshapes 
UNAFold 



11 
11 



Number of times that the most similar structure to the GENE ON resp. GENE OFF structure of the B. subtilis XPT riboswitch was produced by each of the six 
methods investigated. Although the test was made with 34 sequences from the seed alignment of Rfam family RF00167 [31], the sums of each column may 
exceed 34; this is because If two or more methods produced the maximum score, then each was counted. Structural similarity was measured using the 
NestedAlign structural alignment algorithm [36]. While the GENE OFF structure involves a terminator loop, that is often correctly found by thermodynamics- 
based software, the GENE ON secondary structure, having higher free energy (hence less stable thermodynamically) is less likely to be found using 
thermodynamics-based approaches. 
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least N secondary structures from the Boltzmann 
ensemble using Sf old, the relative frequency f k of k- 
neighbors sampled is within s of the probability p k of k- 
neighbors, for all 0 < k < K, with confidence level of (1 
- p). Formally, this means that for each 0 < k < K, 



P[\fk~Pk\ <e)>l-p. 



(4) 



Consider the value k as bin number. For a fixed bin /<, 
let us denote the exact value of ^ by p k . If we sample N 
structures, each falling in the /<th bin with probability p k) 
then the number of structures in the /<th bin is given by 
the random variable X k having binomial distribution 
with mean N • p k and variance N ♦ p k {l - p k ). It follows 
that the proportion ^ of structures in the /cth bin has 
mean \a = p k and standard deviation 
G = y^S^M < _J_ . To determine minimum sample 
size sufficient to ensure a certain approximation accu- 
racy with certain confidence interval, we adapt a stan- 
dard argument from statistics [37] (see equation (2435) 
on p. 529), by approximating the binomial distribution 
by the standard normal distribution using Z-scores. 

Before starting, we mention that it will suffice for our 
intended application of RNAbor- Sample to have a pre- 
cise approximation of those p k which exceed some mod- 
est lower bound, such as 3 = 0.01 or 3 - 0.0001. Thus 
we intend to prove that for all 0 < k < K, if p k > 3, then 
Equation (4) holds. 

Temporarily, we fix k. Let X be a Bernouilli random 
variable with success probability p k , corresponding to 
the indicator random variable that returns 1 if a single 
sampled secondary structure is a /^-neighbor of S 0 . Pro- 
vided that we sample a number N of structures, which 
satisfies N ♦ p k > 30, then the standard normal distribu- 
tion can be used to approximate the left and right tail of 
the distribution of Z-scores of sampled proportions 
f k = £'=i Xfe , defined by 



x — fi = fk-pk = V^jfk -pk) 
Jpk{l-Pk) 



Pk{i-Pk) 

N 



(5) 



Let O(z) = f^ oQ exp(-x 2 /2)dx denote the cumu- 
lative distribution function (CDF) for the standard nor- 
mal distribution. Given desired confidence interval of C 
= 1 - a, recall that the critical value z a n is defined by 



^/2 = 0- 1 (l-a/2) = |0- 1 (a/2)|. 

If s is the margin of error in the left tail (-°o, -z a / 2 ) and 
right tail (z a n> +°°) for the normal approximation of the 
binomial distribution, then by a well-known argument 
(e.g. equation (24.35) on p. 529 of [37]), we have 



Za/2 • 



Pk{l-Pk) 



N 



It follows that 



z 2 z 2 

N = N{a,e) = ^>^.p k {l-p k ) 

provides a sufficient lower bound on number of sam- 
ples necessary to guarantee margin of error s. Let a = -| 
and define 



N = N{e,p,K) = 



^ \2K) 2K 



As 2 



= 2K 

As 2 ' 



(6) 



We have just shown that for N > N(e, p, K), 
P{\z\ > lO- 1 ^)!) < f, hence 



2K 



[ ml 



■pk) 



< p -. 

K 



/ 



The following is now a key step. If we have K bins, 
and we desire to have a small probability p that we are 
off by more than s in our estimate of the probability of 
any bin (in our intended application, the /<th bin, for 0 < 
I< < K, is the collection of /^-neighbors of S 0 , i.e., those 
structures S, whose base pair distance with <S 0 is I<) then 
it suffices that we have a probability £ that we are off 
by more than s in any single bin. Indeed, let Y k denote 
the indicator random variable, with value 1 provided 
that [f k - p k \ > e, where f k is the relative frequency of 
sampling a /c- neighbor of S 0 , after sampling N secondary 
structures, where by Equation (5), N is chosen so that 



P(\z\ > s)=P 



then 



p/ VNQ/fc-ftl) 



> e 



£ 
K 



P{Y 0 v---vY K - l )<K-p/K = p. 

Putting everything together, we have shown that for 
given s, p, K, we can define by defining N 



N = N{s,p,K) 

we have 
/ 

[30 < k < K] 



4e 2 



(7) 



V 



\fk-pk\ 



> <t>~ L — 



Pk(l-pk) 

N 
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2K, 



<P 



J/ 
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We have completed a more rigorous argument using 
Chernoff bounds, but prefer the exposition given here 
for simplicity. Note that the same argument, given ver- 
batim, can be used for any binning procedure. In parti- 
cular, this approach provides information on sufficient 
number of samples in order to approximate the result of 
RNAshapes [8,38,39] by means of sampling. 

We can make some basic conclusions from the above 
analysis. The number of samples sufficient to ensure 
that \fk - Pk\ < 8 f° r 0 < k < K with confidence 1 - p is 
reasonable, and only slightly increases for a higher num- 
ber K of bins or to ensure greater confidence 1 - p. 
However, the number of samples increases greatly when 
higher precision estimates (smaller s values) are needed, 
even for one bin. 

In the case of one bin, it is important to remember 
that the value N is a conservative estimate, sufficient to 
ensure our conclusion. This estimate uses the worst 
case of 50% probability of being in a bin, which maxi- 
mizes the standard deviation. For bins with small prob- 
ability, one can re-estimate what is needed. However, 
for bins with smaller probability, higher precision is 
needed as well, unless all that is required is to verify 
that the bin has small probability. Also, clearly if a bin 
has probability of 10" 6 , then at least on the order of one 
million samples are needed, just for a reasonable prob- 
ability of sampling the bin once. 

Algorithm description 

Given an RNA sequence a = a lf .. ., a m a secondary 
structure S 0 of a, and a maximum desired value Kmax < 
n, the RNAborMEA algorithm computes, for each 1 < i 
< j < n and each 0 < k <Kmax < n, the maximum score 
M(i, j, k) 

2a Pv + J2 Mi 
(i,j)eS iunpaired 

where the first sum is taken over all base pairs (/, y) 
belonging to «S, the second sum is taken over all 
unpaired positions in S, and where p if j [resp. q t ] is the 
probability that /, j are paired [resp. i is unpaired] in the 
ensemble of low energy structures, and a, /3 >0 are 
weights. Our computational experiments, as in Figure 9, 
were carried out with default values of 1 for a, /3. (See 
Equation 1 for the formal definition of Boltzmann base 
pairing probability p it j.) 

The dynamic programming computation of M(i, y, k) 
is performed by recursion on increasing values of y - i 
for all values 1 < i < j < n and 0 < k < Kmax, The value 
of M(i, y, /<), stored in the upper triangular portion of 
matrix M, will involve taking the maximum over three 
cases, which correspond to the inductive construction of 
all secondary structures on a b .. ., ap as described in the 



previous section. At the same time, the value M(j, i, k), 
stored in the lower triangular portion of matrix M, will 
consist of a triple r, k 0 , k\ of numbers, such that the fol- 
lowing approximately holds (in this section, we provide 
the motivating idea; the actual algorithm description, 
which deviates slightly from the description here, is 
given in the next section and in Figures 10 and 11). (i) 
If r = 0 then M(i, y, k) is maximized by a /^-neighbor S 
of S 0 [i, y] for the subsequence a b .. ., Uj in which Uj is 
unpaired. In this case, k 0 - k and I<i - 0. (ii) If r = i, 
then M(i, y, k) is maximized by a /^-neighbor S of S 0 [i, y] 
for the subsequence a b ...,aj in which base pair (/, j) e S. 
In this case, k 0 = 0 and k\ = k - 1. (i) If / < r < y - 6 - 1 
then M(i, j> k) is maximized by a /^-neighbor S of S 0 [i, y] 
for the subsequence a b .. .,aj in which base pair (r, j) e 
S. The left portion of S, which is S[i, r - 1] will be a k 0 
neighbor of S[i, r - 1], while the right portion of S, 
which is 5[r, y] must contain the base pair (r, y) and itself 
be a ki neighbor of 5[r, y]. In summary, the values r, k 0 , 
ki will be used in computing the traceback, where the 
maximum expected accurate structure that is a /c-neigh- 
bor of S[i, y] will be constructed by one of the following: 

(i) MEA /^-neighbor of S[U j - 1], in the event that a,j is 
unpaired in [/, y]; (ii) MEA k - 1-neighbor of S[i + 1, j - 
1], in the event that a b Uj form a base pair; (Hi) MEA 
/c 0 -neighbor of S[i, r - 1] and the MEA /q-neighbor of S 
[r, y], where k 0 + k\ - /c, in the event that a r) Uj form a 
base pair. 

Pseudocode for the algorithm RNAborMEA is given in 
Figures 10 and 11. An array M of size n x n x Kmax is 
required to store the MEA scores in M(i, y, k) for all 
subsequences [/, y] and all base pair distances 0 < k < 
Kmax between structures S[U j] and initially given struc- 
ture S 0 [U /]. For 1 < / < y < n and all 0 < k < Kmax, the 
pseudocode in Figure 11 stores a value of the form (x, y, 
z) in the lower triangular portion, M(j, i, /<), of the array. 
Here, x - 0 indicates that the optimal structure on [/, y], 
i.e., having maximum MEA score over all k-neighbors of 
S 0 [i, y], is obtained by not pairing y with any nucleotide 
in [/, y]; for values x >0, hence x e [/, j - 0 - 1], the opti- 
mal /c-neighbor of S 0 [i, y] is obtained by pairing x with y. 
The values y, z correspond to the values /c 0 , ki, such 
that: (i) if x - 0, then the optimal /^-neighbor of 5 0 [/, j] 
is obtained by first computing the optimal /c 0 -neighbor 
of S 0 [i, j - 1], where k 0 = k - b 0 , then leaving y unpaired; 

(ii) if x = U then the optimal /c-neighbor of S 0 [i, y] is 
obtained by first computing the optimal ki -neighbor of 
S 0 [i + 1, y - 1], where ki - k - bi, then adding the 
enclosing base pair (/, y); (Hi) if x = r e [i + 1, j - 0 - 1], 
then the optimal /^-neighbor of S 0 [i, y] is obtained by 
first computing the optimal /c 0 -neighbor of S 0 [i, r - 1] as 
well as the optimal ki -neighbor of S 0 [r + 1, y - 1], then 
adding the base pair (r, j). This last calculation must be 
done over all values k 0 , k\ such that k 0 + ki - k. Using 



Clote et al. BMC Bioinformotics 2012, 13(Suppl 5):S6 
http://www.biomedcentral.eom/1 471 -21 05/1 3/S5/S6 



Page 17 of 18 



the values M(j, i, k) = (x, y, z), the traceback can be 
easily computed by recursion; see Figure 12 for pseudo- 
code of traceback. 

In a manner similar to the pseudocode of Figures 10 
and 11 (essentially, one replaces the operation of taking 
the maximum by the a summation, and one replaces the 
MEA score by the pseudo-Boltzmann factor exp(MEA 
(S)/RT)), RNAborMEA also computes the pseudo-Boltz- 
mann partition function values 

4? = E exp(MEA(S/RT)). 

{SeS^ B p(S 0 ,S)=fc) 

We have graphed the Boltzmann fe) probabilities as 
well as the uniform probabilities ^m, where is"the 
number of /^-neighbors, and N 1>n is the total number of 
secondary structures. When RT = n, which normalizes 
the MEA score to a maximum of 1, it appears that the 
Boltzmann distribution is the same as the uniform dis- 
tribution, as shown in Figure 13. 
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